Exploratory analysis and regression (in R)

Thomas Lumley

6-7 July 2016

Outline

Why we care

Analytic modelling of survey data is important, over and above just tabulating

People will do it better if they can use most of the same approaches they use for cohort or panel data.

Tools for exploratory analysis are still a gap in many texts and most software. 

Other variance estimation

Supply replicate weights instead of design meta-data: use svrepdesign()

Or, create them from a design object: use as.svrepdesign() (JK1, JKn, BRR, bootstraps)

Various PPS linearisation estimators also available

Exploratory graphics

Summary statistics:

barplot(svyby(~enroll,~stype,design=dstrat,svymean),
         xlab="Mean school size",col="goldenrod")

Forest plots

library(rmeta)
means<-svyby(~enroll,~stype,design=dstrat,svymean)
metaplot(coef(means),SE(means),labels=names(coef(means)),
         nn=1,boxsize=0.2,
         xlab="Mean school size",ylab="School type")

Distributions

Usually choose default bandwidth as if SRS of same size.

CDF

cdf.est<-svycdf(~enroll+api00, dstrat)
cdf.pop<-ecdf(apipop$enroll)
cdf.samp<-ecdf(apistrat$enroll)
plot(cdf.pop,main="Population vs sample", xlab="Enrollment")
lines(cdf.samp,col="red",lwd=2)
lines(cdf.est[["enroll"]],col.points="forestgreen",lwd=2)

Boxplots

svyboxplot(enroll~stype, design=dclus2, col="orange", all.outliers=TRUE)

Density estimators

plot(svysmooth(~api00, dclus2))

Density estimators

svyhist(~api00, dclus2, col="orange", main="")

NHANES data: two 2-year waves

Complete code on github/tslumley/regression-paper

nhanes<- read.csv("combined-data.csv")
des<-svydesign(id=~SDMVPSU,strat=~SDMVSTRA,weights=~fouryearwt,
   nest=TRUE, data=subset(nhanes, !is.na(WTDRD1)))
des<-update(des, sodium=DR1TSODI/1000, potassium=DR1TPOTA/1000)
des<-update(des, namol=sodium/23, kmol=potassium/39)
des
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (60) clusters.
## update(des, namol = sodium/23, kmol = potassium/39)
dim(des)
## [1] 18383   100

Scatterplots: NHANES (14000 points)

plot(BPXDAR~RIDAGEYR,data=nhanes, xlab="Age",ylab="Diastolic",pch=19)

Scatterplots

Hard because

Approaches:

Alpha-blending

Use partially-transparent points:

NHANES: 14000 points

svyplot(BPXDAR~RIDAGEYR,design=des, xlab="Age",ylab="Diastolic",style="trans",pch=19)

NHANES: 14000 points

svyplot(BPXSAR~RIDAGEYR,design=des, xlab="Age",ylab="Systolic",style="trans",pch=19,
        basecol=function(df){ifelse(df$RIAGENDR==1,"blue","pink")})

Hexagonal binning

(Dan Carr, 1987)

NHANES, again

svyplot(BPXDAR~RIDAGEYR,design=des, xlab="Age",ylab="Diastolic",style="hex",legend=0)

Scatterplot smoothers

We only need the curve, not a standard error estimate, so this is easy

Conditioning plots

Show relationships in more than two dimensions by plotting Y~X conditioned on a range of Z

Blood pressure, age, and sex

svycoplot(BPXSAR~BPXDAR|cut(RIDAGEYR,c(0,21,40,60,100))*factor(RIAGENDR,labels=c("M","F")),design=des, xlab="Diastolic",ylab="Systolic",style="hex",xbins=30)

Blood pressure, age, and sex

svycoplot(BPXSAR~BPXDAR|cut(RIDAGEYR,c(0,21,40,60,100)),design=des, xlab="Age",ylab="Systolic",style="trans",
        basecol=function(df){ifelse(df$RIAGENDR==1,"blue","pink")})

And a new idea: (so new there’s a bug in the CRAN version)

library(hextri)
dd<-model.frame(des)[,c("RIDAGEYR","BPXSAR","fouryearwt","RIAGENDR")]
dd<-subset(na.omit(dd), BPXSAR>50 & BPXSAR<200)
hextri(BPXSAR~RIDAGEYR, weights=fouryearwt, class=RIAGENDR,nbins=40,
       style="size",col=c("orange","purple"),data=dd)

Regression models

Instead of solving the Normal equations \[X^T(Y-\mu)=0\] we solve weighted Normal equations \[X^TW(Y-\mu)=0\] where \(W=\mathrm{diag}(\pi_i^{-1})\)

Under no assumptions about \(Y|X\), \(\hat\beta\) is consistent for the population least-squares line.

Variances from delta-method (‘sandwich’) or resampling.

Assumptions in regression

For inference about population associations:

Do we need weights?

If \(E[Y|X=x]=x\beta\), so the model is correctly specified and the weights are independent of \(Y\) given \(X\)

Bias/variance tradeoff: the larger the survey, the more we care about bias, so the more we want to include the weights

Do we need all the weights?

If \(E[Y|X=x]=x\beta\), but weights do not just depend on \(x\), can replace weights by \(w_i = g(x_i)/\pi_i\) for any function \(g\).

Optimal \(g()\) minimises the coefficient of variation of \(w_i\), not far from \(g(x)=E[\pi_i^{-1}|X=x]\)

(Pfefferman & Sverchkov, 1999; Brumback, Hernán, and Robins, 2000: “stabilised weights”)

Useful when \(\pi_i\) depends strongly on \(x\), weakly on \(Y\).

Other regression models

Same principle for generalised linear models: weighted likelihood equation \[D^TV^{-1}W(Y-\mu)=0\]

Similar principle for Cox model (Binder, 1992), loglinear models (Rao & Scott, 1981), proportional odds model, parametric survival models, etc, etc.

Main exception is mixed models: these model pairs of observations, so can’t just reweight with \(\pi_i\).

R functions

All take a model formula and a survey object, plus other options.

NHANES data example

Data on blood pressure and diet from the US NHANES health survey.

Complex four-stage survey, but public-use data approximates by two-stage design.

nhanesdes <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, 
      weights=~fouryearwt, nest=TRUE
      data=subset(nhanes, !is.na(WTDRD1)))
nhanesdes <- update(nhanesdes, sodium=DR1TSODI/1000
      potassium=DR1TPOTA/1000)
nhanesdes <- update(nhanesdes, namol = sodium/23, 
      kmol= potassium/23)
nhanesdes <- update(nhanesdes, htn = (BPXSAR>140) | (BPXDAR>90))

Linear regression example

coef(summary(model<-svyglm(BPXSAR~RIAGENDR+RIDAGEYR+factor(RIDRETH1)
                    +BMXBMI+sodium+potassium, 
               design=nhanesdes)))
##                     Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)       97.0461695 1.38920535 69.8573247 2.353878e-26
## RIAGENDR          -3.3704785 0.38506925 -8.7529152 1.890904e-08
## RIDAGEYR           0.4650764 0.01144886 40.6220626 1.904598e-21
## factor(RIDRETH1)2  0.2377258 1.35465032  0.1754886 8.623768e-01
## factor(RIDRETH1)3 -0.5100238 0.62819669 -0.8118855 4.259650e-01
## factor(RIDRETH1)4  3.0297383 0.64395692  4.7048773 1.206576e-04
## factor(RIDRETH1)5  1.2946580 0.88675393  1.4599969 1.590902e-01
## BMXBMI             0.3710158 0.03806145  9.7478105 3.024477e-09
## sodium             0.4287925 0.16189591  2.6485690 1.502436e-02
## potassium         -0.8499141 0.17133145 -4.9606429 6.577844e-05

Is the relationship nonlinear?

termplot(model,terms=5,partial=TRUE,smooth=panel.smooth)

Perhaps age is nonlinear?

library(splines)
model2<-svyglm(BPXSAR~RIAGENDR*ns(RIDAGEYR,4)+factor(RIDRETH1)
                    +BMXBMI+sodium+potassium, 
               design=nhanesdes)
coef(summary(model2))[c("sodium","potassium"),]
##             Estimate Std. Error   t value     Pr(>|t|)
## sodium     0.3081594  0.1567208  1.966295 0.0694138309
## potassium -0.7229498  0.1636270 -4.418278 0.0005839182

No real change. Weak association may be due to measurement error.

Some tests

AIC(model,model2)
##         eff.p     AIC deltabar
## [1,] 4859.923 3230615 539.9914
## [2,] 7586.726 3103323 474.1704
regTermTest(model2, ~sodium+potassium)
## Wald test for sodium potassium
##  in svyglm(formula = BPXSAR ~ RIAGENDR * ns(RIDAGEYR, 4) + factor(RIDRETH1) + 
##     BMXBMI + sodium + potassium, design = nhanesdes)
## F =  9.770038  on  2  and  14  df: p= 0.0022077

regTermTest(model2, ~factor(RIDRETH1),method="Wald")
## Wald test for factor(RIDRETH1)
##  in svyglm(formula = BPXSAR ~ RIAGENDR * ns(RIDAGEYR, 4) + factor(RIDRETH1) + 
##     BMXBMI + sodium + potassium, design = nhanesdes)
## F =  14.83965  on  4  and  14  df: p= 0.000061444
regTermTest(model2, ~factor(RIDRETH1),method="LRT")
## Working (Rao-Scott+F) LRT for factor(RIDRETH1)
##  in svyglm(formula = BPXSAR ~ RIAGENDR * ns(RIDAGEYR, 4) + factor(RIDRETH1) + 
##     BMXBMI + sodium + potassium, design = nhanesdes)
## Working 2logLR =  41.46407 p= 0.00062401 
## (scale factors:  1.5 1.3 0.75 0.46 );  denominator df= 14

Tests in regression

Basic idea: Rao & Scott (1981,1984) work out the sampling distribution of the pseudolikelihood ratio and the Pearson \(X^2\) score statistic in contingency tables

Lumley & Scott (2015) extend this to generalised linear models with arbitrary covariates.

Quadratic forms have a known (messy, but computable) distribution

AIC

AIC is

Rather than \(-2\ell+2p\), use \(-2\hat\ell+2\hat\Delta\), where \(\hat\Delta\) is the trace of a design-effect matrix.

Under iid sampling, correct model, eigenvalues are 1 or 0 and \(\Delta=p\).

Under complex sampling, we can estimate it: if \(A^{-1}BA^{-1}\) is sandwich estimator, \(\hat\Delta\) is trace of \(A^{-1}B\)

BIC

Can’t just use pseudolikelihood because BIC is about posterior probabilities, not allowed to cheat.

Comes out as a penalised Wald statistic, with penalty \(p\log n^*\) for an effective sample size \(n^*\).

(Lumley & Scott, J. Survey Stat Methodology, 2016)

Maps

R has some maps, can handle shapefiles

brfss<-update(brfss, agegp=cut(AGE, c(0,35,50,65,Inf)))
hlth<-svyby(~I(HLTHPLAN==1), ~agegp+X_STATE, svymean, 
    design=brfss)
    
hlthdata<-reshape(hlth[,c(1,2,4)],idvar="X_STATE",
    direction="wide",timevar="agegp")    
names(hlthdata)[2:5]<-paste("age",1:4,sep="")
    
states@data<-merge(states,hlthdata,
    by.x="ST_FIPS",by.y="X_STATE",all=FALSE)
spplot(states,c("age1","age2","age3","age4"), 
    names.attr=c("<35","35-50","50-65","65+"))

Insurance